Assignment 6: Multi-class classification

For this assignment, we will look in more detail at the per digit accuracy.

The extra credit examines the issue of how much data does one need to do classification (at least in this simple case).

Task 1: Get digits and define subsamples

Here I want you to read in the digit data, and define two samples:

Remmeber to shuffle the data before doing the 90/10 split!

Copy methods from other notebooks

Get the following methods:

Task 2: Run kfold validation training and find the average accuracy per digit

Here we want to loop over the kfolds (just like we did in multiclassv2), but we want to calculate the average per digit accuracy over the folds.

So for each fold, you need to calculate the test accuracy for each digit (0,1,2,..9), and then get the average of these over the 5 folds. You can get the accuracy for each digit within each fold by using the cf_test confusion matrix:

Store these in a dictionary (with key the true class) called "accuracies_by_digit".

Task 3: Train with the full sample, evaluate with the holdout sample

Now using the X and y samples and no k-folds, train with the full sample, using the X_holdout and y_holdout data as your "test" data. You can do this with a single call to runFitter.

Then calcaulate the per digit accuracy for this case (again you can use the confusion matrix from the holdout sample to calculate this).

Compare these accuracies to those obtained from the average of the k-folds, in the following way:

  1. Print them in a simple table
  2. Plot them one against the other (labeling each digit by color).

Extra Credit 1: Another Performance Plot

First, print out the holdout confusion matrix from Task 3.

Note that the accuracy for each digit is the diagonal element for each row, divided by the sum of all of the numbers in that row.

What do the columns tell us? We will define the "fake ratio" as the sum of all of the non-diagonal terms in each column, divided by the diaginal element in this column.

Calculate this "fake ratio", and then:

  1. Print the accuracy and "fake ratio" for each digit in a simple table
  2. Plot them one against the other (labeling each digit by color).

Extra Credit 2: How little data do we need?

We usually focus on getting alot of data to do our training. BUt how many examples of each digit do we really need?

For this exercise, let's do the following:

Plot the average micro accuracy vs sample size for test and training on the same plot.